Syntactic Annotation for the Spoken Dutch Corpus Project (CGN)

نویسندگان

  • Heleen Hoekstra
  • Michael Moortgat
  • Ineke Schuurman
  • Ton van der Wouden
چکیده

Of the ten million words of contemporary standard Dutch in the Spoken Dutch Corpus (Corpus Gesproken Nederlands, CGN), a selection of one million words of natural spoken language will be annotated syntactically. In the present paper we discuss the tag sets and the annotation procedures that are currently being developed and tested. The annotation tags provide information about syntactic constituents and about the semantic relations (dependencies) between these constituents. The annotation graphs allow crossing branches, which makes it possible to represent dependencies independently of surface word order. Moreover, constituents can carry multiple dependency roles, a feature that is exploited in the annotation of non-local dependencies and ellipsis. The annotation process is carried out semi-automatically, using an interactive annotation environment developed within the NEGRA project, a syntactically annotated corpus of German newspaper texts. We illustrate the approach with some real life examples from the CGN corpus, focusing on how some typical spoken language phenomena are dealt with.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Syntactic Analysis in the Spoken Dutch Corpus (CGN)

The paper describes the syntactic annotation of the Spoken Dutch Corpus (“Corpus Gesproken Nederlands” or CGN), the Dutch-Flemish project (1998-2003) aiming at the collection, description and annotation of ten million words of spoken Dutch. In the first part, the background of the parsing strategy is discussed, as well as some details concerning the actual implementation of the parsing process....

متن کامل

Large Scale Syntactic Annotation of Written Dutch: Lassy

The construction of a 500-million-word reference corpus of written Dutch has been identified as one of the priorities in the STEVIN programme. The focus is on written language in order to complement the Spoken Dutch Corpus (CGN) [13], completed in 2003. In D-COI (a pilot project funded by STEVIN), a 50-million-word pilot corpus has been compiled, parts of which were enriched with verified synta...

متن کامل

Using the Spoken Dutch Corpus for type-logical grammar induction

Abstract The dependency-based annotation format employed within the Spoken Dutch Corpus (CGN) project (van der Wouden et al., 2002) has been designed in such a way as to enable a transparent mapping to the derivational structures of current ‘lexicalized’ grammar formalisms. Through such translations, the CGN tree bank can be used to train and evaluate computational grammars within these framewo...

متن کامل

Harvesting Dutch Trees: Syntactic Properties of Spoken Dutch

In this paper, we report on quantitative research into certain word order phenomena in Dutch. In our research, we use the Spoken Dutch Corpus (CGN), a major new resource for research into contemporary spoken Dutch. After briefly introducing the primary data, the annotations added, and some of the tools to explore the primary data and the annotations, we illustrate how the Corpus may be utilized...

متن کامل

JASMIN-CGN: Extension of the Spoken Dutch Corpus with Speech of Elderly People, Children and Non-natives in the Human-Machine Interaction Modality

Large speech corpora (LSC) constitute an indispensable resource for conducting research in speech processing and for developing real-life speech applications. In 2004 the Spoken Dutch Corpus (CGN) became available, a corpus of standard Dutch as spoken by adult natives in the Netherlands and Flanders. Owing to budget constraints, CGN does not include speech of children, non-natives, elderly peop...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2000